Running the following cell to load the package we are going to use.

In [118]:
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt

%matplotlib inline

Baseline model: Emojifier-V1

1- Dataset EMOJISET

Let's start by building a simple baseline classifier.

We have a tiny dataset (X, Y) where:

  • X contains 127 sentences (strings).
  • Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.

data_set.png

Let's load the dataset using the code below. We split the dataset between training (127 examples) and testing (56 examples).

In [119]:
X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')

Run the following cell to print sentences from X_train and corresponding labels from Y_train.

  • Change idx to see different examples.
In [120]:
for idx in range(10):
    print(X_train[idx], label_to_emoji(Y_train[idx]))
never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long  😞
I want to go play ⚾

2 - Overview of the Emojifier-V1

In this part, we are going to implement a baseline model called "Emojifier-v1".

emojifierv1.png

Inputs and outputs

  • The input of the model is a string corresponding to a sentence (e.g. "I love you).
  • The output will be a probability vector of shape (1,5), (there are 5 emojis to choose from).
  • The (1,5) probability vector is passed to an argmax layer, which extracts the index of the emoji with the highest probability.

One-hot encoding

  • To get our labels into a format suitable for training a softmax classifier, lets convert $Y$ from its current shape $(m, 1)$ into a "one-hot representation" $(m, 5)$,
    • Each row is a one-hot vector giving the label of one example.
    • Here, Y_oh stands for "Y-one-hot" in the variable names Y_oh_train and Y_oh_test:
In [121]:
Y_oh_train = convert_to_one_hot(Y_train, C = 5)
Y_oh_test = convert_to_one_hot(Y_test, C = 5)

Let's see what convert_to_one_hot() did. Feel free to change index to print out different values.

In [122]:
idx = 50
print(f"Sentence '{X_train[50]}' has label index {Y_train[idx]}, which is emoji {label_to_emoji(Y_train[idx])}", )
print(f"Label index {Y_train[idx]} in one-hot encoding format is {Y_oh_train[idx]}")
Sentence 'I missed you' has label index 0, which is emoji ❤️
Label index 0 in one-hot encoding format is [ 1.  0.  0.  0.  0.]

All the data is now ready to be fed into the Emojify-V1 model. Let's implement the model!

3 - Implementing Emojifier-V1

As shown in Figure 2 (above), the first step is to:

  • Convert each word in the input sentence into their word vector representations.
  • Then take an average of the word vectors.
  • We will use pre-trained 50-dimensional GloVe embeddings.

Run the following cell to load the word_to_vec_map, which contains all the vector representations.

In [123]:
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('../../readonly/glove.6B.50d.txt')

We have loaded:

  • word_to_index: dictionary mapping from words to their indices in the vocabulary
  • (400,001 words, with the valid indices ranging from 0 to 400,000)
  • index_to_word: dictionary mapping from indices to their corresponding words in the vocabulary
  • word_to_vec_map: dictionary mapping words to their GloVe vector representation.

Running the following cell to check if it works.

In [124]:
word = "cucumber"
idx = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(idx) + "th word in the vocabulary is", index_to_word[idx])
the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos

Implementing sentence_to_avg(). We will need to carry out two steps:

  1. Convert every sentence to lower-case, then split the sentence into a list of words.
    • X.lower() and X.split() might be useful.
  2. For each word in the sentence, access its GloVe representation.

    • Then take the average of all of these word vectors.
    • We might use numpy.zeros().
  3. When creating the avg array of zeros, we will want it to be a vector of the same shape as the other word vectors in the word_to_vec_map.

    • We can choose a word that exists in the word_to_vec_map and access its .shape field.
    • We can use any one of the word vectors that we retrieved from the input sentence to find the shape of a word vector.
In [125]:
def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
    and averages its value into a single vector encoding the meaning of the sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
    """
    
    # Step 1: Split sentence into list of lower case words (≈ 1 line)
    words = (sentence.lower()).split()

    # Initialize the average word vector, should have the same shape as our word vectors.
    avg = np.zeros((50,len(words)))
    
    # Step 2: average the word vectors. We can loop over the words in the list "words".
    total = 0
    for w in words:
        total += word_to_vec_map[w]
    avg = total/len(words)
    
    return avg
In [126]:
avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = \n", avg)
avg = 
 [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983
 -0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.18525867
  0.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.26037767
  0.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.5152061
  0.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.2156265
  1.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925
 -0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333
 -0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.076433
  0.1445417   0.09808667]

Model

We now have all the pieces to finish implementing the model() function. After using sentence_to_avg() we need to:

  • Pass the average through forward propagation
  • Compute the cost
  • Backpropagate to update the softmax parameters

Implementing the model() function described in Figure (2).

  • The equations we need to implement in the forward pass and to compute the cross-entropy cost are below:
  • The variable $Y_{oh}$ ("Y one hot") is the one-hot encoding of the output labels.
$$ z^{(i)} = W . avg^{(i)} + b$$$$ a^{(i)} = softmax(z^{(i)})$$$$ \mathcal{L}^{(i)} = - \sum_{k = 0}^{n_y - 1} Y_{oh,k}^{(i)} * log(a^{(i)}_k)$$
In [127]:
def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
    """
    Model to train word vector representations in numpy.
    
    Arguments:
    X -- input data, numpy array of sentences as strings, of shape (m, 1)
    Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    learning_rate -- learning_rate for the stochastic gradient descent algorithm
    num_iterations -- number of iterations
    
    Returns:
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W -- weight matrix of the softmax layer, of shape (n_y, n_h)
    b -- bias of the softmax layer, of shape (n_y,)
    """
    
    # Define number of training examples
    m = Y.shape[0]                          # number of training examples
    n_y = 5                                 # number of classes  
    n_h = 50                                # dimensions of the GloVe vectors 
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # Convert Y to Y_onehot with n_y classes
    Y_oh = convert_to_one_hot(Y, C = n_y) 
    
    # Optimization loop
    for t in range(num_iterations): # Loop over the number of iterations
        for i in range(m):          # Loop over the training examples
            
            # Average the word vectors of the words from the i'th training example
            avg = sentence_to_avg(X[i], word_to_vec_map)

            # Forward propagate the avg through the softmax layer
            z = np.dot(W,avg)+b
            a = softmax(z)

            # Compute cost using the i'th training label's one hot representation and "A" (the output of the softmax)
            cost = -Y_oh[i]*np.log(a)
            
            # Compute gradients 
            dz = a - Y_oh[i]
            dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db = dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 100 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            pred = predict(X, Y, W, b, word_to_vec_map) #predict is defined in emo_utils.py

    return pred, W, b
In [128]:
print(X_train.shape)
print(Y_train.shape)
print(np.eye(5)[Y_train.reshape(-1)].shape)
print(X_train[0])
print(type(X_train))
Y = np.asarray([5,0,0,5, 4, 4, 4, 6, 6, 4, 1, 1, 5, 6, 6, 3, 6, 3, 4, 4])
print(Y.shape)

X = np.asarray(['I am going to the bar tonight', 'I love you', 'miss you my dear',
 'Lets go party and drinks','Congrats on the new job','Congratulations',
 'I am so happy for you', 'Why are you feeling bad', 'What is wrong with you',
 'You totally deserve this prize', 'Let us go play football',
 'Are you down for football this afternoon', 'Work hard play harder',
 'It is suprising how people can be dumb sometimes',
 'I am very disappointed','It is the best day in my life',
 'I think I will end up alone','My life is so boring','Good job',
 'Great so awesome'])

print(X.shape)
print(np.eye(5)[Y_train.reshape(-1)].shape)
print(type(X_train))
(132,)
(132,)
(132, 5)
never talk to me again
<class 'numpy.ndarray'>
(20,)
(20,)
(132, 5)
<class 'numpy.ndarray'>

Running the next cell to train our model and learn the softmax parameters (W,b).

In [129]:
pred, W, b = model(X_train, Y_train, word_to_vec_map)
print(pred)
Epoch: 0 --- cost = [ 0.          0.          1.73936871  0.          0.        ]
Accuracy: 0.378787878788
Epoch: 100 --- cost = [ 0.          0.          0.06424255  0.          0.        ]
Accuracy: 0.924242424242
Epoch: 200 --- cost = [ 0.          0.          0.04006995  0.          0.        ]
Accuracy: 0.962121212121
Epoch: 300 --- cost = [ 0.          0.          0.03287496  0.          0.        ]
Accuracy: 0.969696969697
[[ 3.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 4.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 3.]
 [ 3.]
 [ 1.]
 [ 3.]
 [ 2.]
 [ 3.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 2.]
 [ 2.]
 [ 2.]
 [ 1.]
 [ 4.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 0.]
 [ 3.]
 [ 4.]
 [ 2.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 2.]
 [ 3.]
 [ 4.]
 [ 2.]
 [ 2.]
 [ 0.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 4.]
 [ 3.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 1.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 3.]
 [ 4.]
 [ 4.]
 [ 2.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 0.]
 [ 3.]
 [ 2.]
 [ 2.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 2.]
 [ 2.]
 [ 4.]
 [ 3.]
 [ 3.]
 [ 2.]
 [ 4.]
 [ 0.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 3.]
 [ 3.]
 [ 2.]
 [ 0.]
 [ 1.]
 [ 2.]
 [ 3.]
 [ 0.]
 [ 2.]
 [ 2.]
 [ 2.]
 [ 3.]
 [ 2.]
 [ 2.]
 [ 2.]
 [ 4.]
 [ 1.]
 [ 1.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 1.]
 [ 2.]
 [ 1.]
 [ 1.]
 [ 3.]
 [ 1.]
 [ 0.]
 [ 4.]
 [ 0.]
 [ 3.]
 [ 3.]
 [ 4.]
 [ 4.]
 [ 1.]
 [ 4.]
 [ 3.]
 [ 0.]
 [ 2.]]

Great! our model has pretty high accuracy on the training set. Lets now see how it does on the test set.

4 - Examining test set performance

  • Note that the predict function used here is defined in emo_util.spy.
In [130]:
print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)
Training set:
Accuracy: 0.977272727273
Test set:
Accuracy: 0.875
  • Random guessing would have had 20% accuracy given that there are 5 classes. (1/5 = 20%).
  • This is pretty good performance after training on only 127 examples.

The model matches emojis to relevant words

In the training set, the algorithm saw the sentence

"I love you"

with the label ❤️.

  • You can check that the word "adore" does not appear in the training set.
  • Nonetheless, lets see what happens if you write "I adore you."
In [131]:
X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])

pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)
Accuracy: 0.833333333333

i adore you ❤️
i love you ❤️
funny lol 😄
lets play with a ball ⚾
food is ready 🍴
not feeling happy 😄

Amazing!

  • Because adore has a similar embedding as love, the algorithm has generalized correctly even to a word it has never seen before.
  • Words such as heart, dear, beloved or adore have embedding vectors similar to love.
    • Feel free to modify the inputs above and try out a variety of input sentences.
    • How well does it work?

Word ordering isn't considered in this model

  • Note that the model doesn't get the following sentence correct:

    "not feeling happy"

  • This algorithm ignores word ordering, so is not good at understanding phrases like "not happy."

Confusion matrix

  • Printing the confusion matrix can also help understand which classes are more difficult for our model.
  • A confusion matrix shows how often an example whose label is one class ("actual" class) is mislabeled by the algorithm with a different class ("predicted" class).
In [132]:
print(Y_test.shape)
print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_test, pred_test)
(56,)
           ❤️    ⚾    😄    😞   🍴
Predicted  0.0  1.0  2.0  3.0  4.0  All
Actual                                 
0            6    0    0    1    0    7
1            0    8    0    0    0    8
2            2    0   16    0    0   18
3            1    1    2   12    0   16
4            0    0    0    0    7    7
All          9    9   18   13    7   56